OMPUTER  SCIENCE  &  TECHNOLOGY: 


Data  Compression 
A  Comparison 
of  Methods 


00-12 


NBS  Special  Publication  500-12 

U.S.  DEPARTMENT  OF  COMMERCE 
National  Bureau  of  Standards 


NATIONAL  BUREAU  OF  STANDARDS 


The  National  Bureau  of  Standards^  was  established  by  an  act  of  Congress  March  3,  1901.  The  Bureau's  overall  goal  is  to 
strengthen  and  advance  the  Nation's  science  and  technology  and  facilitate  their  effective  application  for  public  benefit.  To  this 
end,  the  Bureau  conducts  research  and  provides:  (1)  a  basis  for  the  Nation's  physical  measurement  system,  (2)  scientific  and 
technological  services  for  industry  and  government,  (3)  a  technical  basis  for  equity  in  trade,  and  (4)  technical  services  to  pro- 
mote public  safety.  The  Bureau  consists  of  the  Institute  for  Basic  Standards,  the  Institute  for  Materials  Research,  the  Institute 
for  Applied  Technology,  the  Institute  for  Computer  Sciences  and  Technology,  the  Office  for  Information  Programs,  and  the 
Office  of  Experimental  Technology  Incentives  Program. 

THE  INSTITUTE  FOR  BASIC  STANDARDS  provides  the  central  basis  within  the  United  States  of  a  complete  and  consist- 
ent system  of  physical  measurement;  coordinates  that  system  with  measurement  systems  of  other  nations;  and  furnishes  essen- 
tial services  leading  to  accurate  and  uniform  physical  measurements  throughout  the  Nation's  scientific  community,  industry, 
and  commerce.  The  Institute  consists  of  the  Office  of  Measurement  Services,  and  the  following  center  and  divisions: 

Applied  Mathematics  —  Electricity  —  Mechanics  —  Heat  —  Optical  Physics  —  Center  for  Radiation  Research  —  Lab- 
oratory Astrophysics-  —  Cryogenics^  —  Electromagnetics'  —  Time  and  Frequency". 

THE  INSTITUTE  FOR  MATERIALS  RESEARCH  conducts  materials  research  leading  to  improved  methods  of  measure- 
ment, standards,  and  data  on  the  properties  of  well-characterized  materials  needed  by  industry,  commerce,  educational  insti- 
tutions, and  Government;  provides  advisory  and  research  services  to  other  Government  agencies;  and  develops,  produces,  and 
distributes  standard  reference  materials.  The  Institute  consists  of  the  Office  of  Standard  Reference  Materials,  the  Office  of  Air 
and  Water  Measurement,  and  the  following  divisions: 

Analytical  Chemistry  —  Polymers  —  Metallurgy  —  Inorganic  Materials  —  Reactor  Radiation  —  Physical  Chemistry. 

THE  INSTITUTE  FOR  APPLIED  TECHNOLOGY  provides  technical  services  developing  and  promoting  the  use  of  avail- 
able technology;  cooperates  with  public  and  private  organizations  in  developing  technological  standards,  codes,  and  test  meth- 
ods; and  provides  technical  advice  services,  and  information  to  Government  agencies  and  the  public.  The  Institute  consists  of 
the  following  divisions  and  centers: 

Standards  Application  and  Analysis  —  Electronic  Technology  —  Center  for  Consumer  Product  Technology:  Product 
Systems  Analysis;  Product  Engineering  —  Center  for  Building  Technology:  Structures,  Materials,  and  Safety;  Building 
Environment;  Technical  Evaluation  and  Application  —  Center  for  Fire  Research:  Fire  Science;  Fire  Safety  Engineering. 

THE  INSTITUTE  FOR  COMPUTER  SCIENCES  AND  TECHNOLOGY  conducts  research  and  provides  technical  services 
designed  to  aid  Government  agencies  in  improving  cost  effectiveness  in  the  conduct  of  their  programs  through  the  selection, 
acquisition,  and  effective  utilization  of  automatic  data  processing  equipment;  and  serves  as  the  principal  focus  wthin  the  exec- 
utive branch  for  the  development  of  Federal  standards  for  automatic  data  processing  equipment,  techniques,  and  computer 
languages.  The  Institute  consist  of  the  following  divisions: 

Computer  Services  —  Systems  and  Software  —  Computer  Systems  Engineering  —  Information  Technology. 

THE  OFFICE  OF  EXPERIMENTAL  TECHNOLOGY  INCENTIVES  PROGRAM  seeks  to  affect  public  policy  and  process 
to  facilitate  technological  change  in  the  private  sector  by  examining  and  experimenting  with  Government  policies  and  prac- 
tices in  order  to  identify  and  remove  Government-related  barriers  and  to  correct  inherent  market  imperfections  that  impede 
the  innovation  process. 

THE  OFFICE  FOR  INFORMATION  PROGRAMS  promotes  optimum  dissemination  and  accessibility  of  scientific  informa- 
tion generated  within  NBS;  promotes  the  development  of  the  National  Standard  Reference  Data  System  and  a  system  of  in- 
formation analysis  centers  dealing  with  the  broader  aspects  of  the  National  Measurement  System;  provides  appropriate  services 
to  ensure  that  the  NBS  staff  has  optimum  accessibility  to  the  scientific  information  of  the  world.  The  Office  consists  of  the 
following  organizational  units: 

Office  of  Standard  Reference  Data  —  Office  of  Information  Activities  —  Office  of  Technical  Publications  —  Library  — 
Office  of  International  Standards  —  Office  of  International  Relations. 

^  Headquarters  and  Laboratories  at  Gaithersburg,  Maryland,  unless  otherwise  noted;  mailing  address  Washington,  D.C.  20234. 
2  Located  at  Boulder,  Colorado  80302. 


u 


ri  Bio^au  ef  Stands 

be     COMPUTER  SCIENCE  &  TECHNOLOGY: 


Data  Compression  — 
A  Comparison  of  IVIethods 


Jules  Aronson 


Institute  for  Computer  Sciences  and  Technology 
National  Bureau  of  Standards 
Washington,  D.C.  20234 


U.S.  DEPARTMENT  OF  COMMERCE,  Juanita  M.  Kreps,  Secretary 

Dr.  Sidney  Harman,  Under  Secretary 

Jordan  J.  Baruch,  Assistant  Secretary  for  Science  and  Technology 
NATIONAL  BUREAU  OF  STANDARDS,  Ernest  Ambler,  Acting  Director 


Issued  June  1977 


Reports  on  Computer  Science  and  Technology 


The  National  Bureau  of  Standards  has  a  special  responsibility  within  the  Federal 
Government  for  computer  science  and  technology  activities.  The  programs  of  the 
NBS  Institute  for  Computer  Sciences  and  Technology  are  designed  to  provide  ADP 
standards,  guidelines,  and  technical  advisory  services  to  improve  the  effectiveness  of 
computer  utilization  in  the  Federal  sector,  and  to  perform,  appropriate  research  and 
development  efforts  as  foundation  for  such  activities  and  programs.  This  publication 
series  will  report  these  NBS  efforts  to  the  Federal  computer  community  as  well  as  to 
interested  specialists  in  the  academic  and  private  sectors.  Those  wishing  to  receive 
notices  of  publications  in  this  series  should  complete  and  return  the  form  at  the  end 
of  this  publication. 


National  Bureau  of  Standards  Special  Publication  500-12 

Nat.  Bur.  Stand.  (U.S.),  Spec.  Publ.  500-12,39  pages  (June  1977) 
CODEN:  XNBSAV 


Library  of  Congress  Cataloging  in  Publication  Data 

Aronson,  Jules. 

Data  compression  —  a  comparison  of  methods. 

(Computer  science  &  technology)  (National  Bureau  of  Standards 
special  publication  ;  500-12) 

Bibliography:  p. 

Supt.  of  Docs,  no.:  C13.10:500-12 

1.  Data  compression  (Computer  science)  2.  Coding  theory.  I.  Title, 
n.  Series.  III.  Series:  United  States.  National  Bureau  of  Standards. 
Special  publication  ;  500-12. 

QCI00-U57  no.  500-12  [QA76.9.D33]  602Ms  [00.16'425]  77-608132 


U.S.  GOVERNMENT  PRINTING  OFFICE 
WASHINGTON:  1977 


For  sale  by  the  Superintendent  of  Documents,  U.S.  Government  Printing  Office.  Washington,  D.C.  20402  -  Price  $1.50 

Stock  No.  003-003-01797-3 


TAtJLt;  Of  CONTENTS 


Page 


1.  Introduction    1 

2.  Survey  of  Data  Compression  Techniques    3 

2.1  Null  Suppression    3 

2.2  Pattern  Substitution    5 

2.3  Statistical  Encoding    9 

2.4  Telemetry  Compression    11 

3.  Analysis  of  Data  Compression    12 

3.1  Noiseless  Cooing  Problem    15 

3.1.1  uniquely  Decipherable  Codes    16 

3.1.2  uptimal  Codes   17 

3.2  Realization  of  Optimal  Codes    18 

3.3  Synthesis  of  the  Huffman  Code    21 

4.  CONCLUSIONS    27 

5.  BIBLIOGRAPHY    30 


-iii- 


Acknowledgments 


I  wish  to  acknowledge  tne  help  furnishea  by  beatri 
Marron  and  Dennis  vv.  Fife.  with  the  encouragement  and  a 
sistance  ot  ootn,  but  especially  as.  warron,  tne  ideas  a 
style  ot  the  paper  were  aevelopeo. 


-iv- 


Data  Compression  -  A  Comparison  of  Methods. 
Jules  P.  Aronson 


One  important  factor  in  system  design  and  in 
the  design  of  software  is  the  cost  of  storing 
data.  Methods  tnat  reduce  storage  space  can,  be- 
sides reducing  storage  cost,  be  a  critical  factor 
in  whether  or  not  a  specific  application  can  be 
implemented.  This  paper  surveys  data  compression 
methods  and  relates  them  to  a  standard  statistical 
coding  problem  -  the  noiseless  coding  problem.  The 
well  defined  solution  to  that  problem  can  serve  as 
a  standard  on  which  to  base  the  effectiveness  of 
data  compression  methods.  A  simple  measure,  based 
on  the  characterization  of  the  solution  to  the 
noiseless  coding  problem,  is  stated  through  which 
the  effectiveness  of  a  data  compression  method  can 
be  calculated.  Finally,  guidelines  are  stated  con- 
cerning the  relevance  of  data  compression  to  data 
processing  applications. 

Key  words:  Coding;  Coding  Theory;  Computer 
Storage;  Data  Compaction;  Data  Compression;  Data 
Elements;  Data  Management;  Data  Processing; 
Information  Management;   Information  Theory. 


1.  Introduction 


The  purpose  of  this  report  is  to  assist  Federal  Agen- 
cies in  developing  data  element  standards  that  are  both  com- 
patible within  the  Federal  government  and  economical. 
Specifically,  this  report  responds  to  the  GAO  recommenda- 
tions that  the  Department  of  Commerce  "...  issue  policy, 
delineating  accepted  theory  and  terminology,  and  provide  for 
preparation  of  guidelines,  methodology,  and  criteria  to  be 
followed  by  agencies  in  their  standards  efforts"*.  This  re- 
port delineates  the  theory  and  terminology  of  data  compres- 
sion and  surveys  classes  of  data  compression  techniques. 


*  GAO  report  B-115369;  Emphasis  Needed  On  Government's 
Efforts  To  Standardize  Data  Elements  And  Codes  For 
Computer  Systems;  May  16,   1974;  p33. 


Data  element  standards  activities  in  the  past  have  been 
concerned  with  abbreviations  or  codes  for  specific  terms, 
such  as  the  names  of  countries,  metropolitan  areas,  and 
states.  The  purpose  of  such  representations  has  been  to 
reduce  tne  space  necessary  to  store  such  terms,  while  main- 
taining the  ability  to  reproduce  the  terms  from  the 
representations.  While  each  representation  in  a  given  class 
is  unique,  inter  class  uniqueness  is  not  necessarily  main- 
tained. For  example,  the  standard  abbreviation  for 
CALIFORNIA  is  CA  (1),  but  the  abbreviation  for  CANADA  is 
also  CA  (2).  The  use  of  standard  codes  creates  similar 
problems.  The  code  for  the  geographical  area  of  Alameda 
County,  California  is  06001  (3).  while  that  for  the  stan- 
dard metropolitan  statistical  area  of  Augusta  Georgia  is 
0600  (4).  To  distinguish  between  these  two  codes,  whenever 
they  occur  in  the  same  file,  is  complicated  and  sometimes 
impossible,  since  these  codes  violate  a  coding  principle 
that  one  code  not  be  a  prefix  of  another  (5).  The  decoding 
of  the  above  two  codes  involves  the  inefficient  process  of 
backtracking  through  the  message  stream  after  it  has  been 
rece  ived . 

The  reduction  in  storage,  effected  by  the  use  of  data 
representations,  is  not  as  great  as  the  reduction  that  can 
be  accomplished  by  the  use  of  uniform  and  systematic  tech- 
niques of  data  compression.  This  report  describes  methods 
which  uniformly  compress  the  data,  rather  than  a  select  set 
of  terms.  These  methods  may  be  used  to  replace  standard 
representations  or  may  be  applied  to  data  in  which  some 
terms  are  already  so  represented.  These  methods  could 
reduce  the  high  cost  of  computer  operations  by  eliminating 
unnecessary  incompatibilities  in  the  representation  of  data 
and  by  reducing  the  cost  of  storing  the  data. 

The  cost  of  storing  data  is  a  very  significant  part  of 
the  total  computer  system  cost.  This  cost  is  composed  of  the 
direct  charges  for  the  storage  media,  such  as  disk  devices, 
as  well  as  the  costs  of  transfer ing  the  data  to  and  from 
local  and  remote  storage  devices.  The  latter  costs  are  in 
turn  composed  of  the  costs  of  the  data  channels  and,  for  re- 
motely stored  data,  the  network,  both  of  which  must  have 
sufficient  bandwidth  to  transmit  the  data  .  Data  compres- 
sion results  in  cost  savings  by  reducing  the  amount  of 
storage     required     to     store    data     files.  In  addition,  data 


(1)  Nat.  Bur.  Stand.,  Fed.  Info.  Process.  Stand.  Publ . 
(FIPS  PUB)  5-1 

(2)  FIPS  PUB  10-1 

(3)  FIPS  PUB  6-2 

(4)  FIPS  PUB  8-4 

(5)  see  section  3.1.1 
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comi-.ression  metnods  may  enable  more  efficient  information 
retrieval  operations  as  well  as  more  economical  transmission 
of  large  amounts  of  data  over  computer  networKs.  There  are 
several  types  of  data  compression  techniques  which  range 
from  the  suppression  of  null  characters  to  pattern  substitu- 
tion and  statistical  coding. 

In  this  report  several  types  of  data  compression  tech- 
niques are  discussed  along  with  descriptions  of  some  of 
their  implementations.  Then,  the  data  compression  problem 
is  analyzed  with  respect  to  a  classification  of  compression 
schemes  in  terms  of  the  functional  attributes  of  domain, 
range,  and  operation.  In  addition,  concepts  from  informa- 
tion theory  are  introduced,  in  part  3,  to  give  the  reader  a 
perspective  from  which  to  clarify  and  measure  the  perfor- 
mance of  compression  techniques.  From  information  theory 
the  compression  problem  may  be  seen  as  an  aspect  of  the  more 
general  noiseless  coding  problem.  The  mathematical  portions 
of  part  3  may  be  skipped  without  seriously  affecting  the 
meaning  of  this  report.  Finally,  some  criteria  for  the 
selection  of  techniques  are  discussed  with  regard  to  the 
form  ana  application  of  the  data  structure. 


2.     Survey  of  Data  Compression  Techniques 


2.1     Null  Suppression 

Null  suppression  techniques  encompass  those  methods 
which  suppress  zeros,  blanks,  or  both.  This  type  of  compres- 
sion could  be  called  the  de  facto  standard  method  for 
compressing  data  files.  It  takes  advantage  of  the  pre- 
valence of  blanks  and  zeros  in  some  data  files,  and  is  easy 
and  economical  to  implement.  Null  suppression  may  not,  how- 
ever, achieve  as  high  degree  of  compression  ratio  as  some 
other  techniques.  Its  obvious  application  is  to  card  image 
data  records  which  formed  the  basic  data  structure  of  many 
of  the  earlier  data  management  systems. 

One  way  of  implementing  null  suppression  is  through  the 
use  of  a  bit  map  in  which  a  one  indicates  a  non-null  data 
Item  and  a  zero  indicates  a  null  item.  This  method  is  appli- 
cable to  data  files  having  fixed  size  units,  such  as  words 
or  bytes.  Figure  1  illustrates  the  method  where  a  bit  map 
is  appended  in  the  front  of  a  collection  of  items.  Units 
containing  all  nulls  are  dropped  from  the  collection  and  the 
Dit  which  corresponds  to  sucn  units  is  set  to  zero. 
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Original  Data 


Compressed  Data 


Data  1  I  I  10000100000110 

0  I  I       Data  1 

0  I  I       Data  2 

0  I  I       Data  3 

0  I  I       Data  4 


Data  2 


0 


0 


0 


0 


0 


Data  3 


Data  4 


0 


Figure  1     Zero  Suppression  Using  a  Bit  Map 


Another  way  to  implement  null  suppression  is  the  run 
length  technique  shown  in  figure  2.  A  special  character  is 
inserted  to  indicate  a  run  of  nulls.  Following  that  charac- 
ter is  a  number  to  indicate  the  length  of  the  run.  The 
choice  of  the  special  character  depends  on  the  code  used  to 
represent  the  data.  For  codes  such  as  ASCII  or  EBCDIC  a 
good  choice  is  one  of  the  characters  which  does  not  occur  in 
the  data,  of  which  there  are  many  in  these  codes.  If  the 
character  set  contains  no  unused  characters,  such  as  in  the 
six-bit  codes,  the  technique  may  still  be  used  by  selecting 
an  infrequently  used  character  and  doubling  it  when  it  oc- 
curs as  part  of  the  data. 
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Original  Data: 
Ccxnpressed  Data: 


Item  A10000X02500000im^)z^!^T 
Item  A1#4X025#5N%5COST 


Figure  2    Run  Length  Coding 


2.2     Pattern  Substitution 

The  run  length  technique  is  a  primitive  form  of  a  class 
of  techniques  known  as  pattern  substitution,  in  which  codes 
are  substituted  for  specific  character  patterns.  Data  files 
often  contain  repeating  patterns,  such  as  illustrated  in 
figure  3.  These  may  include  numeric  and  alphabetic  informa- 
tion combined  with  or   in  addition  to  null  characters. 

Original  Data: 

A£100fe)4MFQ00000F3200066CX4 
A£20000DBF0000k)F300000BCXl 
AE30002KBA00000F301214BCX7 

Pattern  Table: 

A£  =  # 

000  =  $ 

00000F3  =  % 

BCX  =  @ 

Compressed  Data 

#1$4MFQ%2$6@4 
«2$0DBF%$00@1 
#3$2RBA%01214@7 


Figure  3    Pattern  Substitution 


A  pattern  table  may  be  constructed  either  in  advance  or 
during  the  compression  of  the  data.  The  table  may  be 
transmitted  with  the  data  or  stored  as  a  permanent  part  of 
the  compressor  and  decompressor.  In  the  method  of  De  Main, 
Kloss,  and  Marron  the  pattern  is  stored  with  the  data,  while 
in  the  method  of  Snyderman  and  Hunt*  the  pattern  is  stored 
in  the  compressor  and  decompressor.     As  in  null  suppression. 


*See  reference  23 
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the  code  for  the  pattern  is  represented  by  unused  characters 
from  the  character  set. 

The  statistical  properties  of  the  patterns  may  be  ad- 
vantageously used  to  increase  the  efficiency  of  the  compres- 
sion. In  the  method  of  Snyderman  and  Hunt,  even  though  tri- 
al and  error  was  used  to  select  the  patterns,  the  resultant 
patterns  were  168  of  some  of  the  most  frequently  occurring 
pairs  of  characters  in  their  textual  data  files.  The  fre- 
quency of  pairs  of  characters  is  further  exploited  by  Jewell 
who  chose  190  of  the  most  frequently  occurring  pairs  as  can- 
didates for  substitution. 

The  compression  method  of  Snyderman  and  Hunt  and  that 
of  Jewell  involve  substituting  single  character  codes  for 
specific  pairs  of  characters.  They  differ  primarily  in  the 
way  the  pairs  of  characters  are  selected,  and  secondarily 
in  the  selection  of  the  substitution  code. 

In  the  method  of  Snyderman  and  Hunt  two  lists  of  char- 
acters are  selected  based  partly  on  their  frequency  of  oc- 
currence in  English  text.  The  first  list,  called  the  "mas- 
ter characters",  is  a  subset  of  the  second  list  called  the 
"combining  characters".  In  the  example  given  by  the  authors 
there  are  eight  master  characters  (  blank,A,E,1 ,0,iM,T,U)  and 
21  combining  characters  (blank,A,B,C,D,E,F,G,H,I ,L,M,N,0,P, 
R,S ,T,U, V ,W) . 

The  first  step  of  the  compaction  process  involves 
translating  each  character  to  a  hexadecimal  code  between  00 
and  41  leaving  190  contiguous  codes  at  the  end,  42  through 
FF ,  for  the  substitution  codes.  Next,  each  translated  char- 
acter is  tested,  in  turn,  to  determine  if  it  is  a  master 
character.  If  it  is  not  such,  then  it  is  output  as  it  is; 
otherwise,  it  is  used  as  a  possible  first  character  of  a 
pair.  When  a  master  character  has  been  found,  the  next 
character  in  the  input  string  is  tested  to  determine  if  it 
is  a  combining  character.  If  it  is,  then  the  code  for  the 
pair  is  calculated  and  replaces  both  of  the  input  charac- 
ters. If  the  next  character  is  not  a  combining  character 
then  the  translated  hexadecimal  representations  for  both  are 
each  moved  to  the  output  stream.  Figure  4  contains  a  table 
of  the  compacted  code,  using  this  scheme. 
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COMPACTED  CODE 


1    Master  I 

COT±)ining  I 

1  Characters  1 

Characters! 

Noncombining  Characters  I 

Combined  Pairs  I 

1          Base  1 

Hay  I 

Hex 

ncA 

ncA  j 

Hex 

Hov  1 

ncA  1 

1  Char  Value  1 

rnr\ci  1 
^vAJC  1 

Char  Code 

CUUc  1 

Char  Code  Char 

r'rv^iii  1 
\-oae  1 

1    jzi     58  1 

K^lu  1 

J 

1  c 
ID 

c\ 
H 

S>R 

N 

PP 

Do 

(\T\  1 

QU  1 

1     A      6D  1 

A 

1 

1\ 

ID 

r 
L 

/ 
\ 

4?  1 

pti. 

Dy 

aa 

fiF  1 

1    £      82  1 

D 

(Al  1 

1  / 

A. 
T 

t  J  1 

WD 

AD 

Ad 

DC  1 

1     I      97  1 

c 

d'i  1 

U  J  1 

X 

1  o 
io 

•f- 
C 

c 

a 

PC 

Do 

AL 

1    0     AC  1 

n 
u 

Y 

19 

u 

1 

pD 

dC 

•  • 

1    N      CI  1 

etc:  1 

Z 

lA 

V 

J0 

c 

pE 

dD 

AW 

ol  1 

1    T     D6  1 

r 

k)o  I 

a 

IB 

w 

31 

47  1 

5E 

£0 
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(in  the  above  ^  =  blank) 
Figure  4 
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Using  the  technique  described,  the  Science  Information 
Exchange  compacted  the  text  portion  of  a  200,000  record  on- 
line file  from  an  average  of  851  to  553  characters  per 
record,  a  decrease  of  35  percent.  Using  an  IBM  360/40  the 
compression  takes  73  ms.  for  1000  characters  while  expansion 
takes  only  65  ras.  The  extent  to  which  the  decrease  was  due 
to  null  suppression  can  not  be  determined  from  the  authors' 
report.  Such  a  determination  would  be  necessary  before  an 
accurate  comparison  between  methods  can  be  made. 

The  method  of  Jewell  takes  into  account  the  full  190 
most  frequently  occurring  character  pairs  in  his  sample, 
thus  taking  advantage  of  the  availability  of  the  190  unused 
codes  in  an  8-bit  representation.  Figure  5,  compiled  by 
Jewell,  is  a  2-character  frequency  distribution  of  the  25 
most  frequently  occurring  pairs  in  a  sample  of  text.  The 
190  pairs  are  entered  into  a  table  which  forms  a  semi- 
permanent part  of  the  compaction  process.  The  first  step  of 
the  process  involves  shifting  the  first  two  characters  of 
the  input  stream  into  a  register.  If  this  pair  occurs  in 
the  combination  table  then  a  code  is  substituted  for  the 
pair.  The  code  is  the  address  of  the  pair  in  the  table. 
Two  new  characters  are  then  entered  and  the  process  resumes 
as  in  the  beginning.  If  the  input  pair  is  not  in  the  table 
then  the  first  character  of  that  pair  is  translated  to  a 
value  greater  then  hexadecimal  BD  (which  equals  190,  the 
length  of  the  table)  and  sent  to  the  output  stream.  One  new 
character  is  shifted  in  with  the  remaining  second  character 
and  the  process  resumes. 
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Rank 


CcMnbination 


Occurrences 


Occurrences 
per  Thousand 


1 

1  1 

1  328 

1 

1  26.89 

2  1 

i6T 

1  292 

1  23.94 

3  1 

TH 

1  249 

1  20.41 

4  1 

)zlA 

1  244 

1  20.00 

5  1 

S0 

1  217 

1  17.79 

6  1 

RE 

1  200 

1  16.40 

7  1 

IN 

1  197 

1  16.15 

8  1 

HE 

1  183 

1  15.00 

9  1 

ER 

1  171 
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10  1 

1  156 

1  12.79 

11  1 

1  153 

1  12.54 

12  1 

1  152 

1  12.46 

13  1 
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1  138 

1  12.13 

14  1 

1  141 

1  11.56 

15  1 

ON 

1  140 

1  11.48 

16  1 

IJzJ 

1  137 

1  11.23 

17  1 

TI 

1  137 

1  11.23 

18  1 

AN 

1  133 

1  10.90 

19  1 

I* 

1  133 

1  10.90 

20  1 

AT 

1  119 

1  9.76 

21  1 

TE 

1  114 

1  9.35 

22  1 

tfx: 

1  113 

1  9.26 

23  1 

1  113 

1  9.26 

24  1 

OR 

1  112 

1  9.18 

25  1 

1  109 

1  8.94 

Partial  results  of  a  2-character  frequency  test 
The  text  size  is  12198  characters 
Figure  5. 


2.3     Statistical  Encoding 

Statistical  encoding  is  another  class  of  data  compres- 
sion methods  wnicn  may  be  used  by  itself  or  combined  with  a 
pattern  substitution  technique.  Statistical  encoding  takes 
aavantage  of  the  frequency  distribution  of  characters  so 
tnat  short  representations  are  used  for  characters  that  oc- 
cur frequently,  and  longer  representations  are  used  for 
characters  that  occur  less  frequently.  When  combined  with 
pattern     substitution,     short  representation  may  be  used  for 
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some  frequently  occurring  pairs  or  other  groups  of  charac- 
ters. Morse  code,  for  example,  uses  short  code  groups  for 
the  common  letters,  and  longer  code  groups  of  the  others. 

When  binary  ones  and  zeros  are  used  to  represent  a  mes- 
sage in  variable  length  codes,  there  must  be  a  way  to  tell 
where  one  character  or  pattern  ends  and  the  other  begins. 
This  can  be  done  if  the  code  has  the  prefix  property,  which 
means  that  no  short  code  group  is  duplicated  as  the  begin- 
ning of  a  lonCjjr  group.  Huffman  codes  have  the  prefix  qual- 
ity and  in  addition  are  minimum  redundancy  codes,  that  is 
they  are  optimal  in  the  sense  that  data  encoded  in  these 
codes  could  not  be  expressed  in  fewer  bits. 

Figure  6  shows  the  combinatorial  techniques  used  to 
form  Huffman  codes.  The  characters,  listed  in  descending 
order  of  frequency  of  occurrence,  are  assigned  a  sequence  of 
bits  to  form  codes  as  follows.  The  two  groups  with  the  smal- 
lest frequencies  are  selected  and  a  zero  bit  is  assigned  to 
one  and  a  one  bit  is  assigned  to  the  other.  These  values 
will  ultimately  be  the  value  of  the  right  most  bit  of  the 
Huffman  code.  In  this  case,  the  right  most  bit  of  A  is  1, 
while  that  of  B  is  0,  but  the  values  of  the  bit  assignments 
could  have  been  interchanged.  Next,  the  two  groups,  A  and 
B,  are  then  treated  as  if  they  were  but  one  group, 
represented  by  BA,  and  will  be  assigned  a  specific  value  in 
the  second  bit  position.  In  this  way  both  A  and  B  receive 
the  same  assignment  in  the  second  bit  position.  The  above 
process  is  now  repeated  on  the  list  E,T,4,BA,  where  BA 
represents  groups  A  and  B,  and  has  frequency  of  10%.  The 
two  least  frequently  occurring  groups,  represented  by  4  and 
BA,  are  selected,  and  a  zero  bit  is  assigned  to  character  4 
and  a  one  bit  is  assigned  to  BA.  These  values  will  be  the 
values  of  the  second  bit  from  the  right  of  the  Huffman  code. 
The  partial  code  assembled  up  to  this  point  is  represented 
in  the  step  2  column  of  Figure  6.  In  each  of  steps  3  and  4 
the  process  is  repeated,  each  time  forming  a  new  list  by 
identifying  the  two  elements  of  the  previous  list  which  had 
just  been  assigned  values,  and  then  assigning  zero  and  a  one 
bit  to  the  two  least  frequently  occurring  elements  of  the 
new  list.  In  this  example,  messages  written  in  the  Huffman 
codes  require  only  1.7  bits  per  character  on  the  average, 
whereas  three  bits  would  be  required  in  the  fixed  lengtn 
representations.  The  synthesis  of  Huffman  codes  will  be 
discussed  in  greater  detail  in  the  next  section. 


-10- 


1  Lnaracter 

1  1 

1  1 
Frequency  I 

j 

step  11 

1 

step  21 

1 

step  31 

Huffman  | 
Code  1 
step  4  1 

1  u 

Ov    «  1 

1  T 

20  %  1 

0  1 

10  1 

1  4 

10  %  1 

0  i 

10  i 

110  i 

1  B 

6  %  1 

0  1 

10  1 

110  1 

1110  1 

1  A 

4  %  1 

1  i 

11  1 

111  1 

nil  1 

Figure  6    Formation  of  Huffman  Code 


2.4     Telemetry  Compression 

Telemetry  compression  techniques  are  not  applicable  to 
most  data  files.  In  telemetry,  a  sensing  device  records 
measurements  at  regular  intervals.  The  measurements  are 
then  transmitted  to  a  more  central  location  for  further  pro- 
cessing. Compression  is  applied  prior  to  transmission  to 
reduce  the  total  amount  of  data  to  be  transmitted.  Telemetry 
data  IS  generally  a  sequence  of  numeric  fields.  In  the  se- 
quence there  are  subsequences  or  runs  of  numeric  fields  with 
values  that  vary  only  slightly  from  each  other.  Compression 
is  achieved  by  coding  each  field,  other  than  the  first,  with 
the  incremental  difference  between  it  and  the  preceding 
field,  provided  the  absolute  value  of  the  increment  is  less 
than  some  pre-de termined  value.  Otherwise,  the  field  is 
represented  as  it  is  with  some  escape  character  to  indicate 
that  the  particular  field  is  not  coded.  The  conditions  that 
make  the  incremental  coding  technique  effective,  the  ex- 
istence of  long  runs  of  similarly  valued  fields,  do  not  exit 
in  most  data  files. 
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3.     Analysis  of  Data  Compression 


Data  compression  may  be  represented  as  the  application 
of  some  function  to  elements  of  the  data  base.  If  we  let  x 
be  a  specified  element  of  the  data  base,  then  the  compres- 
sion of  X  is  y=f (x) . 

Here,  x,  the  element  of  the  data  base,  may  be  a  string 
of  one  or  more  bits,  bytes,  characters,  pairs  or  n-tuples  of 
characters,  words,  or  text  fragments,  f  is  a  function  that 
maps  the  element  x  into  some  other  element  y.  The  domain  of 
a  function  is  that  set  upon  which  the  function  operates, 
while  tne  range  is  that  set  whose  elements  are  the  results 
of  the  function  operation.  The  different  compression  tech- 
niques may  be  characterized  by  the  choice  of  tne  domain, 
range  and  the  operation  of  the  function  f. 


Usually  f  is  invertible,  which  means  that  the  original 
data  may  be  recovered  from  the  compressed  data.  However,  in 
some  applications,  a  non-  invertible  choice  of  f  may  be  ad- 
vantageous. For  example,  when  the  data  base  to  be  compressed 
consists  of  record  identification  keys,  only  an  abbreviated 
form  of  each  key  may  be  necessary  to  retrieve  each  record. 
In  that  case  a  non-inver tible  compression  technique  that  re- 
moves some  of  the  information  from  each  key  would  generate  a 
more  compressed  key  file  than  one  that  was  invertible. 

In  the  method  of  Snyderman  and  Hunt  the  domain  of  f  was 
the  collection  of  pairs  of  characters.  The  range  of  f  was 
the  collection  of  bytes,  and  f  was  invertible.  The  defini- 
tions of  the  Domain  and  Range  for  the  other  methods  are  sum- 
marized in  table  I. 

It  appears  that  compression  techniques  may  be  classi- 
fied in  terms  of  the  type  of  domain,  range  and  operation. 
Of  the  methods  surveyed,  the  domain  was  composed  of  either 
fixed  length  or  variable  length  elements.  The  range,  except 
for  those  techniques  that  generate  Huffman  codes,  was  com- 
posed of  fixed  length  elements.  To  generate  Huffman  codes, 
the  function  maps  the  domain  into  elements  whose  length  is 
inversely  proportional  to  the  frequency  of  occurrence  of  the 
element  in  tne  domain. 

In  some  cases  tne  methods  differ  only  in  the  function 
definition.  The  difference  between  the  method  of  Snyderman 
and  Hunt  and  the  one  for  Huffman  code  with  patterns  is  that 
in  the  first  case  the  function  maps  characters  and  pairs 
into  bytes  while  in  the  latter  case  the  function  maps  these 
same  elements  into  variable  length  fields. 
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Table  I 

Domain  and  Range  of  a  Sample  ot  Data  Compresssion  wetnods 


1  fietiiod 

Domain  | 

Kange  | 

1  Snyderman  &  Hunt 

pairs  of  characters  1 

bytes  1 

1  Schieber  &  Thomas 

M               M               n  1 

II  1 

1  Jewell 

II               11               II  1 

bytes  1 

1  Lynch 

11               n               n  1 

fixed  length  fields  I 

1  Hahn 

Characters  I 

Three  fields  I 
two  are  fixed  length,  | 
other  is  multiple  words  I 

1    Ling  &  Palermo 

fixed  length  fields  I 

fixed  length  fields  I 

1  Schuegraf  &  Heaps 

text  fragments  I 

II            II            II  1 

1      Huffman  Code 
1      with  patterns 

pairs  of  characters  1 

variable  length  I 
binary  strings  I 

The  performance  of  these  methods,  chosen  somewhat  arbi- 
trarily    to  represent  a  cross  sample  of  the  aata  compression 
metnoos  m  the  literature,  differs  both  in  terms  of  percent 
reauction  ana  computation  time.  As  one  may  suspect,   the  more 
complex  methods,  such  as  the  Huffman     code     generators,  re- 
quire    more     computation     time  than  the  simpler  methods  like 
that  of  Snyderman  and  Hunt.  The  Huffman  code  method  did  ob- 
tain a  greater  percent  reduction  than  the  others,  so  the  in- 
creased computation  time  may  be  worthwhile  for  some  applica- 
l       tions.       On     the     other     hand,     the     text  fragment  method  of 
;       Schuegraf  and  Heaps  takes  a  significantly  longer  computation 
1       time  to  accomplish  roughly  the  same  degree  of  compression  as 
'       the  simpler  digraph  methods.     Table  II  contains  a  summary  of 
I       the     published  performance  of  some  data  compression  methods. 
Notice  that  the  measure  of  performance  in  the  table     is  the 
reduction  of  storage  space.  Later   in  the  paper,  that  measure 
will  be  shown  to  be  unreliable  when  compared  to  the  measure 
of  entropy  of  the  data. 
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Table  II 

Published  Results  of  Some  Compression  Technuques 


Method  1 

%  Reduction  1 

Data  Base 

Snyderman  &  Hunt^^^^  I 

35  1 

Smithsonian  Scientific 
Information  Exchange 
171,0!d0,000  characters 

JEWELL  ^-^^^  1 

47  1 

12000  char  text 

(241 

Schieber  &  Thomas^  | 

43.5  1 

40,00w  biliographic  records 
average  of  each  is  535  char 

tnta]   of"  y  1   Avii/i  Mi/iH  chfir 

36  to  46  1 

Institute  of  Elect.  Eng. 

and  British  National  Bibl. 
MARC  system 

[171 

Ling  &  Palermo^  ^  I 

50  i 

not  specified 

Schuegraf  &  Heaps | 

35  1 

Marc  Tapes,  Issue  1 

buff  man  Code^"'-^^  I 
with  Patterns  I 

62  1 

Insurance 
Company  Files 

While  the  compression  methods  described  in  the 
Schuegraf  and  Heaps  paper  have  limited  utility,  because,  as 
noted  above,  their  complexity  does  not  increase  their  effec- 
tivness  over  the  more  simpler  digraph  methods,  the  discus- 
sion of  variable  length  text  fragments  in  that  paper  leads 
to  a  related  question  about  the  structure  of  the  aata  base, 
what  form  should  the  dictionary  take?  Inverted-file  re- 
trieval systems  using  free  text  data  bases  commonly  identify 
words  as  keys  or  index  terms  about  which  tne  file  is  invert- 
eo ,     and  through  which  access  is  provided.     The  words  of  na- 
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tural  language  exhibit  a  Zipfian  *  rank-frequency  relation- 
snip  in  which  a  small  number  of  words  account  for  a  large 
proportion  of  word  occurrences,  while  a  large  number  of 
words  occur  infrequently.  The  inverted-file  system  involves 
large  and  growing  dictionaries  and  thus  may  entail  ineffi- 
cient utilization  of  storage  because  of  distribution  charac- 
teristics. It  may  oe  advantageous  to  consider  the  formation 
of  keys  for  f ile-mver sion  from  units  other  than  words.  In 
particular  if  variable  length  text  fragments  are  chosen  as 
Keys,  then  the  above  compression  method  may  be  a  powerful 
method  of  conserving  space  in  inverted-file  systems.  A  re- 
lated paper  by  Clare,  Cook,  and  Lynch  [4]  discusses  the  sub- 
ject of  variable  length  text  fragments  in  greater  detail. 


3.1     Noiseless  Coding  Problem 

Most  of  the  compression  methods  described  in  the 
literature  are  approximations  to  the  solution  of  the  noise- 
less coding  problem,  which  is  described  as  follows.  A  random 
variable     takes    on    values    Xj^,....,x^    with  probabilities 

p^,  ,p^     respectively.       Code    words      w^^,  ,w^  of 

lengths  n^,....,n^  respectively,  are  assigned  to  the  symbols 

x,,....,x      The  code  words  are    combinations    of  characters 
i  m . 

taken     from     a     code  alphabet  aj^,  '^q'         length  D.  The 

problem  is  to  construct  a  uniquely  decipherable     code  which 

M 

minimizes     the     average     code-word     length    n  =       . n . .  Such 

1  ^ 

codes  will  be  called  optimal  in  this  paper.  Usually  the  al- 
phabet consists  of  the  symbols  0  and  1.  The  problem  may  be 
approached  in  three  steps.  First  we  establish  a  lower  bound 
on  n;  then  we  find  out  how  close  we  can  come  to  that  lower 
bound;  then  we  synthesize  the  best  code.  We  shall  indicate 
to  what  degree  the  various  compression  methods  are  attempts 
to  synthesize  the  best  code. 

*  The  Zipf  distribution  is  a  hyperbolic  distribution  in 
which  the  probability  of  occurrence  of  a  word  is 
inversely  proportional  to  the  rank  of  the  word.  If  r  is 
the  rank  of  a  word,  then  the  probability  p  is  defined 
by  p(r)   =  -;  where  k  is  a  constant  chosen  so     that  the 


N 

sum^p(r^)   =  1. 
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3.1..1  Uniquely  Decipherable  Codes,  What  is  a  uniquely  deci- 
pherable code?  For  example,  consider  the  following  binary 
code ; 

^1  0 

01 


The  binary  sequence  010  could  correspond  to  any  one  of  the 
three     messages     x^,  x^x-j^,  or  x-j^x^.       Since  the  sequence  010 

cannot  be  decoded  accurately,  the  following  definition  is 
needed  to  establish  a  rule  to  avoid  such  sequences. 

A  code  IS  uniquely  decipherable  if  every  finite  se- 
quence of  code  characters  corresponds  to  at  most  one  mes- 
sage . 

One  way  to  insure  unique  decipherability  is  to  require 
that  no  code  word  be  a  prefix  of  another  code  word.  If  A,  B, 
ana  C  are  finite  sequences  of  code  characters,  then  the  jux- 
taposition of  A  and  C,  written  AC,  is  the  sequence  formed  be 
writing  A  followed  by  C.  The  sequence  A  is  a  prefix  of  the 
sequence  B  if  B  may  be  written  as  AC  for  some  sequence  C. 

Codes  which  have  the  above  property,  namely  that  no 
code  word  is  a  prefix  of  another  code  word,  are  called  in- 
stantaneous codes.  The  code  below  is  an  example  of  an  in- 
stantaneous code. 


x^  0 

100 

x^  101 
^4  11 

Notice  that  the  sequences  11111,  10101,  or  1001  do  not 
correspond  to  any  message;  so  such  sequences  should  never 
appear  and  can  be  disregarded.  The  commonly  used  ASCII  and 
EBCDIC  codes  are  also  instantaneous;  but  they  are  such  oe- 
cause  of  their  fixed  length,  since  all  fixed  length  codes 
are  instantaneous.  Every  instantaneous  code  is  uniquely  de- 
cipherable, but  not  conversely.  To  see  this,  for  a  given 
finite  sequence  of  code  characters  of  an  instantaneous  code. 
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proceed  from  left  to  right  until  a  code  word  W  is  formed.  If 
no  such  word  can  be  formed,  then  the  unique  decipherability 
condition  is  vacuously  satisfied.  Since  w  is  not  the  prefix 
of  any  code  word,  ^  must  be  the  first  symbol  of  the  message. 
Continuing  until  another  code  word  is  formed,  and  so  on, 
this  process  may  oe  repeated  until  the  end  of  the  message. 

The  term  instantaneous  refers  to  the  fact  that  the  code 
may  be  deciphered  step  by  step.  If,  when  proceeding  left  to 
right,  w  IS  the  first  word  formed,  we  know  immediately  that 
w  IS  the  first  word  of  the  message.  In  a  uniquely  decipher- 
able code  which  is  not  instantaneous,  the  decoding  process 
may  have  to  continue  for  a  long  time  before  the  identity  of 
the  first  word  is  Known.     For  example,   if  in  the  code 

X  0 

X^  k)0000kJ0(dl 

(n  characters) 

we  received  the  sequence  of  n+1  characters  00.... 001  we 
would  have  to  wait  until  the  end  of  the  sequence  to  find  out 
that  the  first  symbol   is  x^^.     Fortunately,   the     solution  to 

the  noiseless  coding  problem  can  be  realized  with  an  instan- 
taneous code.  Notice  that  while  the  ASCII  and  EBCDIC  codes 
are  instantaneous,   they  are  usually  far  from  optimal. 

2.1.2  Optimal  Codes.  The  degree  of  the  optimality  of  the 
code  is  measured  by  the  entropy  of  the  message  or  text.  The 
entropy  H(X)    is  defined  as 

M 

H(X)   =  -^^log^p. 

where  p^^ ,  ,p^^     are     the     probabilities     of     the  message 

symbols  as  defined  in  the  above  description  of  the  noiseless 
coding  problem. 

Tne  following  theorem  gives  the  lower  bound  to  tne 
average  lengtn  n  of  the  code. 

f  1 1  -  ^ 

(Noiseless  Coding  Theorem)^  If     n  =  n-     is  the 

1  ^ 

average  code  word  length  of  a  uniquely  decipherable  code  for 
the  random  variable  X,   then  n  >  Pi(X)/log  D,  with  equality  if 

and  only  if  p^=D  .  Note  that  H(X)/log  D  is  the 
uncertainty  of  X  using  logarithms  to  the  base  D,   that  is. 
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for  tne  environment  we  are  interested  m,  the  coding 
alphabet  is  binary,  so  b  =  2 .  Thus  tne  lower  bound  is  simply 
n  >  w(X).  ti('A)  is  not  only  the  lower  bound  to  the  length  of 
the  code  needeo  to  represent  the  data,  it  also  provides  a 
measure  ot  the  improvement  that  may  oe  expected  by  compress- 
ing the  data.  The  comparison  ot  the  value  ot  h(x)  to  tne 
current  average  code  size,  which  is  8  for  ASCII  or  LBCDIC, 
gives  a  measure  of  the  improvement  that  can  oe  realizea  by 
compressing  the  data.  If  H(X)=8  then  no  compression  is  real- 
izable by  coding  the  data  differently;  if  H(X)=5  then  up  to 
an  8  to  5  compression  ratio  may  be  obtained.  The  comparison 
of  the  improvement  realized  by  a  specific  data  compression 
technique  to  the  theoretic  improvement  given  by  the  above 
ratio  can  serve  to  evaluate  the  effectivness  of  the  tech- 
nique. The  measure  of  effectivness  usually  given,  the  file 
length  before  and  after  compression,  does  not  indicate  the 
true  level  of  compression,  since  the  compression  may  have 
been  due  mainly  to  null  suppression. 

Any  code  that  achieves  the  lower  bound  of  the  noiseless 
coding  theorem  is  called  absolutely  optimal.  The  following 
code  is  an  example  of  an  absolutely  optimal  code. 

X        Probabilities        Code  Words 
X,  1/2  ^ 

xt  1/4  10 

x':  1/6  110 

X^  1/8  111 

H(X)   =  H  =  I 

In  a  previous  example  ot  a  Huffman  code,  figure  6,  the 
average  code  length  of  the  Huffman  code  was  1.7  bits  per 
character,  while  the  value  of  the  entropy  H(X)  was  1.15b 
bits  per  character.  That  example  illustrates  the  general 
impossibility  of  contructing  an  absolutely  optimal  code  for 
arbitrary  collections  of  characters.  That  example  also  il- 
lustrates that  any  coding  method  will  be  bound  by  the  value 
of  H (X)  . 


3.2     Realization  of  Optimal  Codes 


While  the  theorem  states  the  existence  of  an  absolutely 
optimal  code,  in  general  the  construction  of  one  for  an  ar- 
bitrary set  of  probabilities  is  impossible.     For  a  given  set 
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of  probabilities  p^^ ,  ^p^^^,     if     the     code     is     to  be 

absolutely  optimal,  the  lengths  of  the  code  words  must  be 
chosen  to  satisfy  p-   =  D        which  is  the  same  as 


(-log  p.) 

n  .  =  —  ±_. 

1  log  D 

Obviously  each  n^  may  not  be  an  integer  and  yet  satisfy  the 
above  condition.  However  we  may  do  the  next  best  thing  by 
choosing  the  integer  n^  to  satisfy  the  inequalities: 

-log  -logpj^ 
logo       1  "i  <     logo  ^ 

An  instantaneous  code  can  be  shown  to  exist  in  which  the 
code  lengths  satisfy  the  above  inequality.  The  following 
tneorem  characterizes  such  codes. 

Given  a  random  variable  X  with  uncertainty  H(X),  there 
exists  a  Dase  D  instantaneous  code  for  X  whose  average 
code-wora  length  n  satisfies 


iLUl  <  n  <  ^  +  1 


logD  ^  "  ^  logo 
For  a  proof  see  Ash,  page  39. 


This  theorem  says  that  the  average  code-word  length  may 
be  made  sufficiently  small  to  be  within  one  digit  of  the 
lower  bound  set  by  the  noiseless  coding  theorem.  That  lower 
bound  may  be  approached  arbitrarily  close  if  block  coding  is 
used.  The  success  of  the  digram  coding  schemes  is  due  to  the 
fact  that  block  coding  of  length  2  is  used.  Block  coding 
works  as  follows.  Instead  of  assigning  a  code  word  to  each 
symbol  x^,  we  assign  a  code  word  to  each  group  of  s  symbols. 

In  other  words,  we  construct  a  code  for  the  random  vector 
Y  =   {X^,X^,  '^s^  '     where  the  X^  are  independent  and  each 

X^  has  the  same  distribution  as  X.     If     each     X^     assumes  M 

values     then     Y     assumes  values.     The  following  example 

illustrates  the  decrease  in  the  average  code-word  length  by 
block  cod ing . 
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Code  Word     Y  =   {X^,X^)     p        Code  Word 


0 

^1^1 

9/15 

0 

1 

3/16 

10 

3/16 

110 

1/16 

111 

X 


X,  3/4 
X2  1/4 


n  =  1 

n  =  y/16  +  3/16    (2)    +  1/4  (3) 

=  27/16  code  char ac ter s/2  values 
■  '  ■  oL  X 

=  27/32  code  characters/value 
ot  X 

By  the  above  theorem,   the  average  code-word     length  n 
for   the  block  of  length  s  satisfies 

log^  D  1  "s  ^  loq^  D      ^  code  characters/value  of  Y. 
H(Y)    =  H(X,,  ,X  )    <  H(X,)  +  +H(X  )   whether  or  not  the 

M  b  X  w 

X^  are  independent  from  each  other.   If  they  are  independent, 

then  the  inequality  becomes     an     equality.     If     the     X^  are 

identically  distributed,   then  H(X-,)  +  +H(X  )   =  sH(X).  In 

the  classical  case,  both  independence  and  identical  distri- 
bution are  assumed,  in  which  case,  the  average  code  word 
length  satisfies 


log  D  -     s      log  D 


or 


H(X)     <  !ls  ^  h(X)  1 
log  D  -  s        log  u  s* 

While  for  text  files  and  messages,  the  independence  of  each 
X^     IS     a  tenuous  assumption,   the  assumption  that  each  X^  is 

identically  distributed  is  credible.  Upon  dropping  the  in- 
dependence assumption  the  above  inequality  becomes 

"^^1^  ^^s^    .  fT     .  H(X)      .  1 

s{log  D)         ^  "s  ^  log  D  s* 

Thus  we  see  that  regardless  of  the  independence  of  the  ele- 
ments   of     the    block,     the     upper  bound  of  the  average  code 
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lengtn  may     oe     made     as     close     to     log     p  desired  by 

increasing  the  block  length.  On  the  other  hand,  the  lower 
limit  may  be  smaller  when  the  elements  of  the  blocK  are  not 
independent  as  is  the  case  frequently  in  text  files.  Thus 
for  the  conditions  applicable  to  text  files  and  messages  the 
average  code-word  length  may  be  made  at  least  as  small  as 
the  optimal  length  characterized  by  the  noiseless  coding 
theorem.  The  dependence  of  characters  in  text  files  may  ex- 
plain why  the  simple  digraph  methods  are  so  successful.  That 
dependence  is  further  exploited  in  the  method  of  V\(agner 
which  substitutes  codes  for  entire  English  phrases. 


3.3     Synthesis  of  the  Huffman  Code 

So  far  only  the  existence  of  optimal  codes  has  been 
discussed;  now  the  synthesis  of  one  such  code,  the  Huffman 
code,  will  be  illustrated.  For  the  synthesis  of  optimal 
codes,  only  the  instantaneous  codes  need  to  be  considered 
since  if  a  code  is  optimal  with  respect  to  the  class  of  in- 
stantaneous codes,  then  it  is  also  optimal  with  respect  to 
all  uniquely  decipherable  codes.  This  characteristic  is 
inaeeo  fortunate  since  instantaneous  codes  are  the  codes  of 
cnoice  for  data  transmission  and  processing  applications. 
Tne  precise  statement  of  this  characteristic  is  as  follows. 

If  a  code  C  is  optimal  within  the  class  of  instantane- 
ous coaes  for   the  given  probabilities  Pj^fP2'  '^n'  which 

means  tnat  no  other  instantaneous  code  for  the  same  given 
set  of  probabilities  has  a  smaller  average  code-word  length 
than  C,  then  C  is  optimal  within  the  entire  class  of  unique- 
ly decipherable  codes. 

For  a  proof  see  Ash  page  40. 

An  optimal  binary  code  can  be  characterized  by  certain 
necessary  conditions  which  restrict  the  choices  of  code 
lengths  that  may  be  assigned  to  each  code.  These  characteri- 
zations are  as  follows. 

Given  a  binary  code  C  with    word     lengths     n-^,n.^,  .  .  .  ,n^^ 

associated  with  a  set  of  symbols  with  probabilities 
P-]^  ,P2 ,  .  .  .  .  ,Pj^,  assume,   for  convenience,   that  the  symbols  are 

arranged           in  order         of        decreasing  probability 

(pj^  >  p^  >   >  p^^)   and  that  a  group  of     symbols     with  the 

same     probability     is     arranged  in  oraer  of  increasing  code- 
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word         length.  (If         p.   =  p.^^  =   =  p._^^,  then 

<  "i+i  "i+r-^   Then  if  C  is  optimal  within  the  class 

of  instantaneous  codes,  C  must  have  the  following  proper- 
ties: 

a.  Higher  probability  symbols  have  shorter  code  words, 
that  is,  p^   >  p^  implies  n^   <  n^^. 

b.  The  two  least  probable  symbols  have  code  words  of 
equal  length,   that  is,  n^  ,   =  n,  . 

c.  Among  the  code  words  of  length  n^^  there  must  be  at 
least  two  words  tnat  agree  in  all  digits  except  the  last, 
tor  example,   the  following  code  cannot  be  optimal  since  coae 

100  ^ 
X  101 
X  1101 

X^  1110 

words  4  and  5  do  not  agree  in  the  first  three  places. 
For  a  proof  see  Ash  page  41. 

The  construction  of  a  Huffman  code  for  the  characters 
^l''***'^n  ^^^^       probabilities      Pj^f-.-./P^j  respectively, 

involves  generating  a  binary  tree^"''^  for  which  each  of  the 
above  characters  is  represented  as  a  terminal  node  and  the 
other  nodes,  the  internal  nodes,  are  formed  in  the  following 
manner.  First  from  the  two  nodes  with  smallest  probabili- 
ties,  say  Cj^  and  c^,  a  new  node  c-j^  2  with  probability 

is     formed     to     be     the     father     of  Cj^  and  c^.     Now  with  the 

reduced  set  of  n-1  nodes,  which  consists  of  c,    .>,c^,  ,c 

1 ,  z     J  n 

with     probabilities  P2+P2 'P3 ' • • • 'P^  respectively,  repeat  the 

above  procedure;  and  continue  to  repeat  it  until  reauced  set 
consists  of  only  two  nodes,  wow  consider  the  binary  tree 
which  consists  of  the  terminal  nodes  and  all  the  new  nodes 
formed     by     the     above  process.   For  eacn  successive  pairs  of 


llj  A  binary  tree  is  a  graph  wnich  consists  of  a  root 
node  and  descendent  nodes.  From  the  root  node  are 
links  to  at  most  two  other  nodes,  the  descendants  of 
the  root  node.  Each  of  these  descendants,  in  turn,  are 
linked  to  no  more  than  two  other  nodes;  and  these 
latter  nodes  may  be  similarly  linked  to  other  nodes, 
and  so  on. 
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Drancnes,  starting  at  the  root,  assign  the  values  Id  and  1  to 
each  link  of  the  branch.  The  resultant  code  for  each  of  the 
Characters  is  the  sequence  of  assigned  values  obtained  by 
tracing  the  tree  from  the  root  to  each  of  the  terminal 
nodes.  Each  aggregate  causes  the  items  so  chosen  to  have  a 
code  length  of  one  more  binary  digit;  so  the  average  length 
is  minimized  by  giving  this  extra  digit  to  the  least  prob- 
able clump.     The  following  example  illustrates  the  method. 

Let     the     characters      be  ,C2 ,03 ,0^ ,0^      and  have 

probabilities  .3,  .3,  .2,  .15,  .05,  respectively.  In  the 
tree  which  results  from  the  above  method,  the  terminal  nodes 
are  represented  by  squares,  the  other  nodes  by  circles,  and 
in  each  square  and  circle  is  the  probability  of  the  node. 


The  Huffman  code  for  each  of  the  characters  is: 

Code 

00 

01 

10 
110 
111 


A  variation  of  the  Huffman  code,  a  variable  length  al- 
phabetic code,  is  explained  in  a  paper  by  Hu  and  Tucker. 
There,  a  tree,  which  is  optimal  in  another  sense,  is  ob- 
tained which  preserves  the  original  order  of  the  terminal 
nodes.  Using  their  algorithm,  alphabetical  codes  may  be  gen- 
erated which,  though  not  as  optimal  as  a  Huffman  code,  en- 
ables ordering  operations  to  be  applied  to  the  coded  text  in 
the  same  way  as  the  uncoded  text. 


Character 
^5 
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Observe  that  for  the  formation  of  the  Huffman  code  the 
distribution  of  the  characters  or  blocks  must  be  known  in 
advance.  It  may  appear  that  the  Huffman  code  is  valid  only 
for  each  instance  or  version  of  the  data  so  that  a  new  code 
may  have  to  be  generated  for  each  data  base  and  for  each 
change  to  the  data  base.  Fortunately,  the  distribution  of 
characters  is  not  that  sensitive  to  changes  in  the  data.  One 
study  has  shown  that  the  distribution  of  characters  for  a 
particular  data  base  is  stable  over  a  period  of  time.  [18J 
Moreover  the  same  distribution  seems  to  be  relatively  stable 
across  different  English  text  data  bases.  The  following 
graph  shows  the  distribution  of  characters  in  a  typical  En- 
glish text. 
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Normal  frequency  distribution  of  the  letters  of  the  alphabet 
(in  uses  per  thousand) 

The  following  table,  from  the  paper  by  Lynch,  Petrie, 
and  bnell  [16],  shows  a  distribution  of  characters  which  is 
close  to  that  m  the  graph. 

For  a  given  Huffman  code,  changes  m  the  average  code 
word  length  witn  respect  to  changes  in  the  distribution  of 
the  characters  may  be  analyzed  in  the  following  way.  Let  tne 
code     word  lengths  be  n,,n^,  ,n^,  where  n,<no<  <"m  ' 
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and  the  probabilities  of  the  characters  are  P,  ,  P.. ,  .  .  .  .  ,P  . 
Suppose  that  the  I'th  probability  changes  by  the  amount  d ^ ^ 
so  that  p^  =  Pj^+d^  IS  the  new  i'th  probability.  The  new 
average  code  wora  length  is 

_        m  m  _  m 

^'  =^^1^1  =f(Pi+^i)^i  =  ^^f^i'^i- 


ra  m  m-1 

Let  D  =^d^n^.     Then  since  ^'o,   =  0,  D  =   :^d.(n  -n  There 
j^j-i  ^   i-  1  ^ 

are  two  interesting  cases  to  consider.  The  first  occurs  when 

d.>i(j  for   i=l ,  2  ,  .  .  ,  .  ,m-l .       Then,     since     n  -n  <  0,     D<  0  so 
1—  1    m—  — 

n'  _<  n.  The  second  case  occurs  when  d^  <  0  for 
i  =  l ,  2 , . . . , ,m-l .  Then  n'  >  n.  If  the  changes  d^  are 
restricted  so  that 

Id  .  I  < 

'    -  %-"i 

then 

m-1  m-1   ,       -/T  -,ra-l. 

If  a  <       then  D  <  l-(^)''"~-'-  <  1.     It  appears  tnat  as  long  as 

the  distribution  of  cnaracters  changes  only  sligntly,  from 
data  base  to  data  base,  a  Huffman  code  designed  for  one  of 
the  data  bases  will  be  adequate  for  the  others.  Further 
study  of  the  variation  of  Huffman  codes  with  respect  to 
changes  in  the  data  base  is  needed  before  more  detailed 
statements  can  be  made  about  the  performance  oi  huffman 
codes  when  such  changes  occur. 
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4.  CONCLUSIOiSlS 


Several  types  of  compression  methods  have  been  dis- 
cussed along  with  the  underlying  coding  theory  and  the  meas- 
ures for  evaluating  the  effectiveness  of  a  compression 
method.  It  was  shown  that  the  data  compression  problem  is 
the  same  as  the  optimal  coding  problem  when  the  data  file  is 
considered  as  a  collection  of  independent  characters.  Since 
data  characters  are  generally  not  independent,  the  optimal 
code  may  be  even  shorter  than  that  predicted  by  the  noise- 
less coding  theorem,  thus  possibly  permitting  even  greater 
compression.  A  good  measure  of  the  effectiveness  of  the 
method  is  not  the  percent  reduction,  used  in  some  of  the 
referenced  papers,  but  the  ratio  of  the  entropy  H(x)  of  the 
data  file  to  the  average  encoded  character  size  in  bits.  If 
the  compression  is  at  least  as  good  as  the  optimal  code  then 
the  ratio  is  greater  than  or  equal  to  1,  otherwise  it  is 
less  than  one. 

Tne  steps  to  oe  followed  in  selecting  or  determining  a 
need  tor  a  data  compression  method  involve  the  calculation 
of  tne  entropy  of  tne  data.     These  steps  are: 

1.  weasure  ti(X),  where 

N 

ri(X)=  5'p^log    (p^) . 
1  =  1 

In  the  above  formula  for  h(X),  p^=f^/F,  where  f^  is  the 
frequency     of  the  I'th  type  of  element  of  the  data  file,  and 

F  is  the  total  number  of  elements  in  the  file   {F=^'t.),  and 

i=l  ^ 

N  is  the  number  of  distinct  types  of  elements.  As  in  sec- 
tion 3.1,  the  data  file  is  composed  of  a  sequence  of  ele- 
ments which  are  usually  characters.  In  ASCII  data  files, 
there  are  128  different  types  of  characters  that  may  occur 
in  the  file;  however,  since  control  characters  usually  do 
not  occur  in  a  file,  most  ASCII  files  will  have  only  96  pos- 
sible types  of  characters.  Alternatively  H  can  be  calculat- 
ed from  the  equivalent  expression 

H(X)=(1/F)  ^"  f  log. (f J    -  log  (F) 
1=1   ^       ^    /  2 

by  summing  the  values  f*log^(f)    for  each  character,  dividing 
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by  F  and  then  subtracting  loq^(P),     For  large  data  files,  it 

is  not  necessary  to  base  the  calculations  on  the  entire 
file,  but  only  on  part  of  the  file,  say  the  first  100,000 
bytes  if  the  file  is  homogeneous,  or  one  can  use  some  random 
sampling  procedure  to  estimate  the  frequencies  f^. 

2.  Determine  the  current  average  character  length  n  in 
bits.  For  ASCII  and  EBCDIC  tiles  this  value  will  usually  be 
8.  If  H(X)  is  much  less  than  n  then  a  statistical  compres- 
sion method  will  be  effective.  If,  on  the  other  hand,  h(X) 
is  close  to  n  then  such  methods  will  not  be  effective;  How- 
ever some  type  of  pattern  substitution  may  be  applicable. 
For  example,  if  H(X)=7  and  the  current  code-word  lengtn  is  8 
then  some  improvement  would  be  expected  by  compressing  the 
data,  but,  on  the  other  hand  a  greater  improvement  is  to  be 
expected  when  ri(X)=b  and  the  current  lengtn  is  8. 

3.  If  the  data  is  numerical,  then  a  numerical  method 
such  as  polynomial  predictors  and  polynomial  curve  fitting 
algorithms  [5-9]  may  be  superior  to  the  methods  discussed  in 
this  report. 

4.  If  the  data  is  text  or  a  combination  of  text  and 
numerical  tables,  and  the  data  is  compressible  as  indicated 
in  step  2,  then  either  a  digraph  method  or  a  Huffman  method 
would  compress  the  data.  The  digraph  method  is  much  easier 
to  implement,  and  runs  faster  than  the  Huffman  method,  while 
the  latter  obtains  a  higher  degree  of  compression.  The 
choice  of  the  compression  method  will  depend  on  the  charac- 
teristics and  applications  of  the  data.  Data  files  which 
contain  mostly  numeric  fields  would  be  compressible  by  an 
entirely  different  algorithm  than  would  text  files.  Fre- 
quently accessed  files  may  need  an  algorithm  which  runs 
quicker  than  that  for  less  frequently  accessed  files,  even 
though  the  data  compression  obtained  by  the  faster  algorithm 
IS  tar  less  then  optimal.  within  the  same  file  system  parts 
of  the  file  may  be  more  efficiently  compressea  with  dif- 
ferent methods.  The  dictionary*  of  an  information  management 
system  may  be  compressed  with  a  simple  yet  fast  algorithm, 
while  the  corresponding  data  files,  because  they  are  infre- 
quently accessed,  may  be  compressed  with  a  more  complex  al- 


*  The  dictionary  as  used  here,  refers  to  the  collection 
of  pointers  of  an  inverted  file  system.  Each  pointer, 
by  pointing  to  a  record  of  the  file,  functions  in  a 
manner  analogous  to  a  word  of  an  English  language 
d  ic t ionary , 


-28- 


9oritnm  wnich  is  slower  Dut  realizes  more  compression.  A 
variable  lengtn  alphabetic  code**,  wnich  has  some  ot  the  op- 
timal properties  ot  the  Huffman  code,  may  be  used  to 
compress  the  dictionary. 

5.  The  effectiveness  of  a  particular  data  compression 
methoa  can  be  measured  by  comparing  the  average  character 
length  of  the  data  file  after  it  has  been  compressed  to  the 
value  of  the  entropy  of  the  file.  If  the  average  character 
length,  after  compression,  is  close  to  the  value  of  the  en- 
tropy then  the  method  is  as  effective  as  an  optimal  statist- 
ical compression  method.  If  the  value  of  the  average  is 
still  significantly  greater  than  the  value  of  the  entropy, 
then  the  data  compression  method  is  not  as  effective  as  pos- 
sible . 

Data  compression  is  relevant  to  a  data  processing  ap- 
plication when  its  use  is  significant  or  meaningful  to  the 
user.  Its  use  is  warranted  when  it  effects  at  least  one  of 
the  following: 

1.  Significant  cost  reduction 

2.  Significant  storage  reduction 

3.  Allowing  the  implementation  of  the  application 
which  otherwise  could  not  have  been  implemented 
due  to  insufficient  storage 

4.  A  significant  decrease  in  the  data  transfer 
time . 

The  notion  of  what  is  significant  to  a  user  is  relative  to 
the  users  environment.  To  a  mini-computer  user  with  limitea 
disc  storage,  a  reduction  of  a  few  thousand  bytes  of  storage 
may  be  significant,  while  to  a  large  system  user  such  a 
reauction  would  be  insignificant.  while  the  ultimate  deci- 
sion ot  whether  or  not  data  compression  is  relevant  depends 
on  the  users  special  requirements  and  judgement,  the  follow- 
ing three  guidelines  will  be  applicable  in  most  cases. 

1.  If  the  quantity  of  data  is  small,  say  under 
100,000  bytes,  or  if  the  life  of  the  data  is 
short,  then  data  compression  would  not  be  advis- 
able . 

2.  Large  data  files,  over  100,000  bytes,  the  life 
of  which  is  not  short,  are  good  candidates  for 
data  compression. 

3.  A  group  of  data  files,  where  the  files  have 
similar  character  composition,  is  a  good  candidate 
for  data  compression  when  the  size  of  the  group  is 
more  than  100,000  bytes. 


see  section  3.3 
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.'  and  chemical  research,  wit*-  ^^-^  _,r  emphasis  on 
indards  of  physical  measu'  ^  ,  fundamental  con- 
ants,  and  properties  of  ^<s*  -asued  six  times  a  year, 
.nual  subscription:  D-  ^•P"^^,  $17.00;  Foreign,  $21.25. 

•  .Mathematical  Sci'  ^^p^^Section  B) 

-tudies  and  com"-    ^*  .is  designed  mainly  for  the  math- 
i  matician  anH  .  QO*<;tical  physicist.  Topics  in  mathemat- 
al  statis*^^{;ef^.ieory  of  experiment  design,  numerical 
i»lys'    qvJ* -.retical  physics  and  chemistry,  logical  de- 
ltp    ^  ^programming  of  computers  and  computer  sys- 
x^jnort  numerical  tables.  Issued  quarterly.  Annual 
.,cription:  Domestic,  $9.00;  Foreign,  $11.25. 
DI.MENSIONS/NBS  (formerly  Technical  News  Bulle- 
n) — This  monthly  magazine  is  published  to  inform 
ientists,  engineers,  businessmen,  industry,  teachers, 
Lidents,  and   consumers   of  the   latest  advances  in 
lence  and  technology,  with  primary  emphasis  on  the 

i .. ork  at  N'BS.  The  magazine  highlights  and  reviews 
such  issues  as  energy  research,  fire  protection,  building 
'  technology,    metric    conversion,    pollution  abatement, 

i health  and  safety,  and  consumer  product  performance. 
(In  addition,  it  reports  the  results  of  Bureau  programs 
in  measurement  standards  and  techniques,  properties  of 
matter  and  materials,  engineering  standards  and  serv- 
I  ices,  instrumentation,  and  automatic  data  processing, 
if     Annual  subscription:  Domestic, §12. .50;  Foreign, $15.65. 

NONPERIODICALS 

Monographs — Major  contributions  to  the  technical  liter- 
ature on  various  subjects  related  to  the  Bureau's  scien- 
tific and  technical  activities. 
J  V  Handbooks — Recommended   codes  of  engineering  and 
(industrial  practice  (including  safety  codes)  developed 
•in  cooperation  with  interested  industries,  professional 
organizations,  and  regulatory  bodies. 
^Special  F'ublications — Include  proceedings  of  conferences 
sponsored   by   N'BS,   NBS   annual   reports,  and  other 
'special  publications  appropriate  to  this  grouping  such 
'as  wall  charts,  pocket  cards,  and  bibliographies. 
Applied  .Mathematics  Series — Mathematical  tables,  man- 
'  uals,  and  studies  of  special  interest  to  physicists,  engi- 
■  neers,    chemists,     biologists,    mathematicians,  com- 
jputer  programmers,  and  others  engaged  in  scientific 
'and  technical  work. 

National   Standard    Reference   Data   Series — Provides 
quantitative  data  on  the  physical  and  chemical  proper- 
|ties  of  materials,  compiled  from  the  world's  literature 
(and  critically  evaluated.  Developed  under  a  world-wide 
; program  coordinated  by  NBS.  Program  under  authority 
of  National  Standard  Data  Act  (Public  Law  90-396). 


NOTE:  At  present  the  principal  publication  outlet  for 
these  data  is  the  Journal  of  Physical  and  Chemical 
Reference  Data  (JPCRD)  published  quarterly  for  NBS 
by  the  American  Chemical  Society  (ACS)  and  the  Amer- 
ican Institute  of  Physics  (AIP).  Subscriptions,  reprints, 
and  supplements  available  from  ACS,  1155  Sixteenth 
St.  N.W.,  Wash.  D.  C.  20056. 

Building  Science  Series — Disseminates  technical  infor- 
mation developed  at  the  Bureau  on  building  materials, 
components,  systems,  and  whole  structures.  The  series 
presents  research  results,  test  methods,  and  perform- 
ance criteria  related  to  the  structural  and  environmental 
functions  and  the  durability  and  safety  characteristics 
of  building  elements  and  systems. 

Technical  Notes — -Studies  or  reports  which  are  complete 
in  themselves  but  restrictive  in  their  treatment  of  a 
subject.  Analogous  to  monographs  but  not  so  compre- 
hensive in  scope  or  definitive  in  treatment  of  the  sub- 
ject area.  Often  serve  as  a  vehicle  for  final  reports  of 
work  performed  at  NBS  under  the  sponsorship  of  other 
government  agencies. 

Voluntary  Product  Standards — Developed  under  proce- 
dures published  by  the  Department  of  Commerce  in  Part 
10,  Title  15,  of  the  Code  of  Federal  Regulations.  The 
purpose  of  the  standards  is  to  establish  nationally  rec- 
ognized requirements  for  products,  and  to  provide  all 
concerned  interests  with  a  basis  for  common  under- 
standing of  the  characteristics  of  the  products.  NBS 
administers  this  program  as  a  supplement  to  the  activi- 
ties of  the  private  sector  standardizing  organizations. 
Consumer  Information  Series — Practical  information, 
based  on  NBS  research  and  experience,  covering  areas 
of  interest  to  the  consumer.  Easily' understandable  lang- 
uage and  illustrations  provide  useful  background  knowl- 
edge for  shopping  in  today's  technological  marketplace. 

Order  above  NBS  publications  from:  Superintendent 
of  Documoits,  Government  Printing  Office,  Washington, 
D.C.  201,02. 

Order  following  NBS  publications— NBSIR's  and  FIPS 
from  the  National  Technical  Information  Services, 
Springfield,  Va.  22161. 

Federal  Information  Processing  Standards  Publications 
(FIPS  PUBS) — Publications  in  this  series  collectively 
constitute  the  Federal  Information  Processing  Stand- 
ards Register.  Register  serves  as  the  official  source  of 
information  in  the  Federal  Government  regarding  stand- 
ards issued  by  NBS  pursuant  to  the  Federal  Property 
and  Administrative  Services  Act  of  1949  as  amended, 
Public  Law  89-306  (79  Stat.  1127),  and  as  implemented 
by  Executive  Order  11717  (38  FR  12315,  dated  May  11, 
1973)  and  Part  6  of  Title  15  CFR  (Code  of  Federal 
Regulations). 

NBS  Interagency  Reports  (NBSIR) — A  special  series  of 
interim  or  final  reports  on  work  performed  by  NBS  for 
outside  sponsors  (both  government  and  non-govern- 
ment). In  general,  initial  distribution  is  handled  by  the 
sponsor;  public  distribution  is  by  the  National  Techni- 
cal Information  Services  (Springfield,  Va.  22161)  in 
paper  copy  or  microfiche  form. 


BIBLIOGRAPHIC  SUBSCRIPTION  SERVICES 


The  following  current-awareness  and  literature-survey 
bibliographies  are  issued  pervodically  by  the  Bureau: 
Cryogenic  Data  Center  Current  Awareness  Service.  A 

literature  survey  issued  biweekly.  Annual  subscrip- 
tion: Domestic,  825.(X)  ;  Foreign,  830.0<J . 
•Liquified  Natural  Gas.  A  literature  survey  issued  quar- 
terly. Annual  subscription:  $20.00. 


Superconducting  Devices  and   Materials.  A  literature 

survey  issued  quarterly.  Annual  subscription:  830.00  . 
Send  subscription  orders  and  remittances  for  the  pre- 
ceding bibliographic  services  to  National  Bureau  of 
Standards,  Cryogenic  Data  Center  (275.02)  Boulder, 
Colorado  80302. 
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